Author: Yichu Chen (916397593)
Analyzing car price is a rather popular topic for data science projects. Some well-known data sources, such as Kaggle.com, Data.world and UCI Machine Learning Repository, have already provided us with a lot of datasets that contain car price and common car features like car model, mileage, performance, vehicle type, fuel efficiency, etc. These datasets, although convenient for quick analysis, are usually out-dated and contain limited information. A lot of features in these kinds of datasets are well-suited for modeling, but not very good in terms of providing useful and practical insights for used car buyers (for instance, knowing the exact weight and size of a car might help to predict the price, but it is not that meaningful for actual buyers). In this project, I will apply web-scrapping techniques to retrieve and analyze used car data based on the most recent car listings from Cars.com. Exploratory data analysis based on features provided by Cars.com will be carried out to visualize the findings that may provide prospect car buyers more relevant information regarding used car prices.
The dataset that is used in this project comes from Cars.com, and includes all used car listings within 50 miles from Davis, CA, 95616. After dropping out listings with missing price or make/model, there are a total of 9976 observations. To obtain this dataset, I first scrapped through all listings, obtained their URLs, and then made requests to scrap useful data directly from each car's page (somehow this is more efficient as I have tested out, strangely). To improve efficieny, threading is also used to loop through all cars' URLs. The data provided by Cars.com are not enough for me to explore all questions of my interests. For instance, it did not provide countries of origin for different car brands or horsepower/displacement information, and so on. To obtain countries of origin, I additional scrapped other website (https://www.canstarblue.com.au/vehicles/car-country-of-origin/). I also made requests based on VIN number to a publicly available car information API provided by the US Department of Transportation (https://vpic.nhtsa.dot.gov/api). One big advantage of using the API is that it provides much more details and useful features. Also, the API helped me to standardize string values for certain features. For example, in principle it would not be easy to distinguish between "BMW 3" and "3 Series" or "325i", etc based on data from Cars.com, even with the help of RegEx. For web-scrapping, I mainly used requests, lxml and BeautifulSoup, which helped me to navigate through various complex HTML tree structures and fetch the desired information. For preprocessing and data integration, Pandas and Numpy are my main tool. For visualization, I extensively used the Plotly library, because it provides interactive tools that automatically generates javascript codes for the plots in HTML so that sliders and drop-down menus can be used without creating additional python files or relying on web-platforms to host the plots.
Let's first take a look at the overall price distribution. The following plot shows the distribution of used car prices (log-transformed) by cars' countries of origin (i.e. in which country the brand is founded). The curves are kernel density estimation based on empirical probabilities given the prices (i.e. counts normalized by total count). Differences in price distributions can be easily spotted based on this plot:
One might also be interested in whether plant country of a car, or where the car was originally manufactured and produced, makes a difference. One common belief might be that imported cars are more expensive in general than cars produced locally. The answer, as it turns out, really also depends on the car brand. To reduce the confounding effect from Model Year and Mileage, we select only the cars whose model year is beyond 2019 and whose total mileage is less than 80k. The results are shown below.
Next, let's switch our attention to car brands, which may also have big impacts on used car prices. For instance, it is widely known that Toyota makes cars that preserves car values really well. In addition, with the rise of electric cars, Tesla vehicles could be pretty popular and expensive even in used car markets. Let us explore those factors using the following visualization!
Here, one thing to note is that there could be confounding effects. For instance, SUVs and performance cars would in general be more expensive than sedans. In addition, different car models could have different MSRPs. Those factors make it infeasible (or inacurate) to directly look at the mean prices for each brand since the proportions of different types of cars and models are different. In the following, instead of using mean prices, I propose to use weighted mean prices. Denoting $X_i$'s as prices for individual vehicles belonging to some brand (so $i$ from 1 to n is the index for cars in that brand) and $w_i$ as corresponding weights, the weighted mean we use is calculated as:
$$\text{Weighted Mean} = \frac{\sum_{i=1}^n w_i X_i}{\sum_{i=1}^n w_i} $$where $w_i$ are essentially inverse probability weights for different models in each brand. For instance, if $X_i$ is the used car price for a Model 3 vehicle from Tesla, then $w_i = (\hat{p}_{\text{Model 3}})^{-1}$ with $\hat{p}_{\text{Model 3}}$ being the empirical probability of Model 3 among all Tesla vehicles. This way, disproportionality of different car makes within each car brand category is better addressed. Since car makes and car types are essentially interrelated, I did not do further weighting based on car type. The weighted means were further square-rooted to aid visualization.
Now let's make some observations from the plot below:
Rivian, Ferrari, Lamborghini, Rolls Royes and McLaren obviously stand out as Tier 1 brands (in terms of how expensive they are on average); this is then followed by Mercedes, Porsche, Tesla, Ram, Lucid, Aston Martin and Lotus as Tier 2 brands.
We all know that year and mileage both are important factors for used car prices. One question of interst might be: do relationship between price and mileage change over time? From this plot, one may quickly observe that even after log-transformation the price still exhibits some quadratic patterns (decaying at a faster pace at the beginning) when model year is between 2022 and 2024. For others whose model years are prior to 2022, such relationship tends to be linear. This is consistent with our common sense: new cars depreciate faster than old cars; in general such price decay has an exponential decaying rate (e.g. at a slower pace when mileage is high).
The relationship between price and consumer ratings about car models is not so clear. Still, one can observe some patterns:
The above plot shows us that, for instance, there is no clear association between price and car model's performance ratings. But is this really the case? Let's then take a look at the relationship between price and a few other performance figures, including horsepower and displacement (L). Of course, we'd also like to cluster the data by year group to avoid model year's confounding effect.
The first plot down below (purple!) shows that the log-price is linearly associated with horsepower (hp). This suggests that, when horsepower increases, prices tend to increase at an exponential rate. The second plot reveals how displacement is associated with price. Unlike horsepower, such association is not very strong. Still, cars with higher displacement values (either performance cars or trucks/SUVs) tend to be more expensive. I suspect that the association between displacement and price might be stronger for new cars...
Now, (and this is my favorite part) let's study some additional features, including entertainment, safety and convenience features. The following bar plots show proportions of different car features among all vehicles at all model years since 2000. One can adjust the slider and see the dynamics of when certain features (such as Bluetooth and CarPlay/Android Auto) start to emerge and become more popular.
In terms of entertainment features:
Lastly, let's explore whether past car history significantly changes the prices. In the following, we conduct 2 sample t-tests on car prices after grouping by different categories. Hopefully the normality assumption holds reasonably well after the log-transformation... In the following section we use $\alpha=0.05$ as our significance level and consider testing a two-sided null $H_0: \mu_1 = \mu_2$.
About 40% of used vehicles in our sample have only 1 prior owner. A 2-sample t test suggests that there is a significant difference between the two categories (i.e. whether one previous owner or not), with the boxplot suggesting that having only one owner tend to push the price a bit higher. Of course this result might change when we group the data by model year and mileage.
About one-fourth of all cars in the used car market have had at least one accident reported before. The confounding effect of year and mileage should be weaker on this variable. From the t-test we see that there are significant differences between car prices of the two categories.
While damage and accidents will almost surely impact car's overall "quality" and reliability for future owners, open recall is not necessarily concerning. Many of the open recalls are simply relaed to software updates, and rarely in today's car market are cars being recalled for critical conditions (at least such scenario does not apply to as many as 13.8% of cars' recalls from a rational guess). Still, the t-test tells us that there are significant differences, such that cars with at least one open recall may depreciate.
In this small project, we have explored various possible factors that can be influencing the used car prices in today's car market. Most of our conclusions are consistent with our common sense. In the analysis provided, I have tried my best to control for possible confoundings (mainly the confounding effect of year and mileage) by grouping the data based on countries, brand and model year and using weighting technique in order to achieve more balance in underlying covariates in each plot. By also adding interactive features, users can explore their own questions of interest. There are a lot more stories that can be told, possibly with the help of some statistical models like OLS and LMM. In addition, using similar web-scrapping technologies, this same set of procedure can be applied in analysis of car data based on other car sales platforms, such as CarGurus and KelleyBlueBook. It might be interesting to see whether some platforms tend to overprice the cars.
Lastly, I would like to thank Prof. Kramlinger and TA Sophia for sharing your knowledge in web-scrapping technologies and guiding us through this fun course. I would give it a solid 10/10 in terms of class quality!
[NbConvertApp] Converting notebook STA_220_Final_Project_Visualizations.ipynb to html [NbConvertApp] Writing 6576008 bytes to STA_220_Final_Project_Visualizations.html